NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Perplexity: Evaluating Transcript Abundance Estimation in the Absence of Ground Truth

https://doi.org/10.4230/LIPIcs.WABI.2021.4

Fan, Jason; Chan, Skylar; Patro, Rob (July 2021, 21stInternational Workshop on Algorithms in Bioinformatics (WABI 2021).)
Carbone, Alessandra; El-Kebir, Mohammed (Ed.)
There has been rapid development of probabilistic models and inference methods for transcript abundance estimation from RNA-seq data. These models aim to accurately estimate transcript-level abundances, to account for different biases in the measurement process, and even to assess uncertainty in resulting estimates that can be propagated to subsequent analyses. The assumed accuracy of the estimates inferred by such methods underpin gene expression based analysis routinely carried out in the lab. Although hyperparameter selection is known to affect the distributions of inferred abundances (e.g. producing smooth versus sparse estimates), strategies for performing model selection in experimental data have been addressed informally at best. Thus, we derive perplexity for evaluating abundance estimates on fragment sets directly. We adapt perplexity from the analogous metric used to evaluate language and topic models and extend the metric to carefully account for corner cases unique to RNA-seq. In experimental data, estimates with the best perplexity also best correlate with qPCR measurements. In simulated data, perplexity is well behaved and concordant with genome-wide measurements against ground truth and differential expression analysis. To our knowledge, our study is the first to make possible model selection for transcript abundance estimation on experimental data in the absence of ground truth.
more » « less
Full Text Available
The Most Parsimonious Reconciliation Problem in the Presence of Incomplete Lineage Sorting and Hybridization Is NP-Hard

https://doi.org/10.4230/LIPIcs.WABI.2021.1

LeMay, Matthew; Wu, Yi-Chieh; Libeskind-Hadas, Ran (January 2021, 21st International Workshop on Algorithms in Bioinformatics (WABI 2021))
Carbone, Alessandra; El-Kebir, Mohammed (Ed.)
The maximum parsimony phylogenetic reconciliation problem seeks to explain incongruity between a gene phylogeny and a species phylogeny with respect to a set of evolutionary events. While the reconciliation problem is well-studied for species and gene trees subject to events such as duplication, transfer, loss, and deep coalescence, recent work has examined species phylogenies that incorporate hybridization and are thus represented by networks rather than trees. In this paper, we show that the problem of computing a maximum parsimony reconciliation for a gene tree and species network is NP-hard even when only considering deep coalescence. This result suggests that future work on maximum parsimony reconciliation for species networks should explore approximation algorithms and heuristics.
more » « less
Full Text Available
Parsimonious Clone Tree Reconciliation in Cancer

https://doi.org/10.4230/LIPIcs.WABI.2021.9

Sashittal, Palash; Zaccaria, Simone; El-Kebir, Mohammed (January 2021, 21st International Workshop on Algorithms in Bioinformatics (WABI 2021))
Carbone, Alessandra; El-Kebir, Mohammed (Ed.)
Full Text Available
Compressing and Indexing Aligned Readsets

https://doi.org/10.4230/LIPIcs.WABI.2021.13

Gagie, Travis; Gourdel, Garance; Manzini, Giovanni (January 2021, Workshop on Algorithms in Bioinformatics (WABI))
Carbone, Alessandra; El-Kebir, Mohammed (Ed.)
Compressed full-text indexes are one of the main success stories of bioinformatics data structures but even they struggle to handle some DNA readsets. This may seem surprising since, at least when dealing with short reads from the same individual, the readset will be highly repetitive and, thus, highly compressible. If we are not careful, however, this advantage can be more than offset by two disadvantages: first, since most base pairs are included in at least tens reads each, the uncompressed readset is likely to be at least an order of magnitude larger than the individual’s uncompressed genome; second, these indexes usually pay some space overhead for each string they store, and the total overhead can be substantial when dealing with millions of reads. The most successful compressed full-text indexes for readsets so far are based on the Extended Burrows-Wheeler Transform (EBWT) and use a sorting heuristic to try to reduce the space overhead per read, but they still treat the reads as separate strings and thus may not take full advantage of the readset’s structure. For example, if we have already assembled an individual’s genome from the readset, then we can usually use it to compress the readset well: e.g., we store the gap-coded list of reads’ starting positions; we store the list of their lengths, which is often highly compressible; and we store information about the sequencing errors, which are rare with short reads. There is nowhere, however, where we can plug an assembled genome into the EBWT. In this paper we show how to use one or more assembled or partially assembled genome as the basis for a compressed full-text index of its readset. Specifically, we build a labelled tree by taking the assembled genome as a trunk and grafting onto it the reads that align to it, at the starting positions of their alignments. Next, we compute the eXtended Burrows-Wheeler Transform (XBWT) of the resulting labelled tree and build a compressed full-text index on that. Although this index can occasionally return false positives, it is usually much more compact than the alternatives. Following the established practice for datasets with many repetitions, we compare different full-text indices by looking at the number of runs in the transformed strings. For a human Chr19 readset our preliminary experiments show that eliminating separators characters from the EBWT reduces the number of runs by 19%, from 220 million to 178 million, and using the XBWT reduces it by a further 15%, to 150 million.
more » « less
Full Text Available
Fast Approximate Shortest Hyperpaths for Inferring Pathways in Cell Signaling Hypergraphs

https://doi.org/10.4230/LIPIcs.WABI.2021.20

Krieger, Spencer; Kececioglu, John (January 2021, 21st International Workshop on Algorithms in Bioinformatics (WABI 2021))
Carbone, Alessandra; El-Kebir, Mohammed (Ed.)
Cell signaling pathways, which are a series of reactions that start at receptors and end at transcription factors, are basic to systems biology. Properly modeling the reactions in such pathways requires directed hypergraphs, where an edge is now directed between two sets of vertices. Inferring a pathway by the most parsimonious series of reactions then corresponds to finding a shortest hyperpath in a directed hypergraph, which is NP-complete. The state of the art for shortest hyperpaths in cell-signaling hypergraphs solves a mixed-integer linear program to find an optimal hyperpath that is restricted to be acyclic, and offers no efficiency guarantees. We present for the first time a heuristic for general shortest hyperpaths that properly handles cycles, and is guaranteed to be efficient. Its accuracy is demonstrated through exhaustive experiments on all instances from the standard NCI-PID and Reactome pathway databases, which show the heuristic finds a hyperpath that matches the state-of-the-art mixed-integer linear program on over 99% of all instances that are acyclic. On instances where only cyclic hyperpaths exist, the heuristic surpasses the state-of-the-art, which finds no solution; on every such cyclic instance, enumerating all possible hyperpaths shows that the solution found by the heuristic is in fact optimal.
more » « less
Full Text Available

Search for: All records